Abstract: For achieving data parallelism in Apache Hadoop, MapReduce is the most prominent programming model. Lots of efforts are attempting for boost the computational speed of MapReduce in Hadoop framework. In this paper, we present a MapReduce programming model focused on the Kmeans clustering algorithms that leverage the acceleration potential of the integrated GPU in a multi-node cluster environment. It accelerated the framework by providing intra parallelism between the MapReduce function by using modified k-means algorithm. Based on various experiments on multi node cluster and depth analysis, we find that utilizing of the integrated GPU via OpenCL offers significant performance and power efficiency gains over the original CPU based or sequential approaches.

Keywords: Hadoop, Map/Reduce, OpenCL and KMeans.